Succinct data structures for flexible text retrieval systems

نویسنده

  • Kunihiko Sadakane
چکیده

We propose succinct data structures for text retrieval systems supporting document listing queries and ranking queries based on the tf*idf (term frequency times inverse document frequency) scores of documents. Traditional data structures for these problems support queries only for some predetermined keywords. Recently Muthukrishnan proposed a data structure for document listing queries for arbitrary patterns at the cost of data structure size. For computing the tf*idf scores there has been no efficient data structures for arbitrary patterns. Our new data structures support these queries using small space. The space is only 2/ times the size of compressed documents plus 10n bits for a document collection of length n, for any 0 < ≤ 1. This is much smaller than the previous O(n log n) bit data structures. Query time is O(m+q log n) for listing and computing tf*idf scores for all q documents containing a given pattern of length m. Our data structures are flexible in a sense that they support queries for arbitrary patterns.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Succinct Data Structures for NLP-at-Scale

Succinct data structures involve the use of novel data structures, compression technologies, and other mechanisms to allow data to be stored in extremely small memory or disk footprints, while still allowing for efficient access to the underlying data. They have successfully been applied in areas such as Information Retrieval and Bioinformatics to create highly compressible in-memory search ind...

متن کامل

Cell probe lower bounds for succinct data structures

In this paper, we consider several static data structure problems in the deterministic cell probe model. We develop a new technique for proving lower bounds for succinct data structures, where the redundancy in the storage can be small compared to the informationtheoretic minimum. In fact, we succeed in matching (up to constant factors) the lower order terms of the existing data structures with...

متن کامل

Engineering a Distributed Full-Text Index

We present a distributed full-text index for big data applications in a distributed environment. The index can be used to answer different types of pattern matching queries (existential, counting and enumeration) and also be extended to answer document retrieval queries (counting, retrieve and top-k). We also show that succinct data structures are indeed useful for big data applications, as the...

متن کامل

A Framework for Dynamizing Succinct Data Structures

We present a framework to dynamize succinct data structures, to encourage their use over non-succinct versions in a wide variety of important application areas. Our framework can dynamize most stateof-the-art succinct data structures for dictionaries, ordinal trees, labeled trees, and text collections. Of particular note is its direct application to XML indexing structures that answer subpath q...

متن کامل

New algorithms on wavelet trees and applications to information retrieval

Wavelet trees are widely used in the representation of sequences, permutations, text collections, binary relations, discrete points, and other succinct data structures. We show, however, that this still falls short of exploiting all of the virtues of this versatile data structure. In particular we show how to use wavelet trees to solve fundamental algorithmic problems such as range quantile que...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • J. Discrete Algorithms

دوره 5  شماره 

صفحات  -

تاریخ انتشار 2007